The goal of this project is to implementing a machine learning algorithm for predicting if a subject is performing weight lifting exercise correctly.
The Weight Lifting Exercises (WLE) dataset (source: http://groupware.les.inf.puc-rio.br/har) is used in the project. The dataset was collected by recording signals from wearable sensors while the subjects perform weight lifting activities. Six young healthy participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. More information about the WLE dataset can be found in [1], or visit http://groupware.les.inf.puc-rio.br/har#ixzz3xhadU0A4. The training data can be dowloaded here https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The WLE training dataset consists of 160 variables and 19622 observations of the variables. The “classe” variable in the training set (values are A, B, C, D or E) is the outcome that the algorithm should predict. The training dataset is partitioned into 2 parts. A 75% portion is used for model training and cross validation and a 25% portion as test data for estimating the out-of-sample error. The final chosen algorithm is then applied for predicting the outcome of the 20 test cases. The test data can be downloaded here : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The caret package is used in this project for building and evaluation of the machine learning algorithms.
Some exploratory analysis on the training dataset shows that columns 1 to 7 consist of variables not directly obtained from the wearable sensors. Various plots showing the relationship of these variables to the “classe” variables are investigated.
Here, 3 of the plots are shown. The first plot shows that variable “X” in column 1 is some indexing for dataset which is sorted by the “classe” variable. Columns 2 to 7 consist of timestamps and measurement window related data. By plotting variables against any sensor measured variable and group by “classe”, they do not show meaningful correlation to “classe”. It seems more like it indicates the different time the subjects are performing the exercises.
Based on these findings, columns 1 to 7 are manually dropped from the features set.